McCurr Consultancy is an MNC that has thousands of employees spread across the globe. The company believes in hiring the best talent available and retaining them for as long as possible. A huge amount of resources is spent on retaining existing employees through various initiatives. The Head of People Operations wants to bring down the cost of retaining employees. For this, he proposes limiting the incentives to only those employees who are at risk of attrition. As a recently hired Data Scientist in the People Operations Department, you have been asked to identify patterns in characteristics of employees who leave the organization. Also, you have to use this information to predict if an employee is at risk of attrition. This information will be used to target them with incentives.
The data contains demographic details, work-related metrics and attrition flag.
In the real world, you will not find definitions for some of your variables. It is a part of the analysis to figure out what they might mean.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import scipy.stats as stats
from sklearn import metrics
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')
hr=pd.read_csv("HR_Employee_Attrition-1.csv")
# copying data to another varaible to avoid any changes to original data
data=hr.copy()
data.head()
| EmployeeNumber | Attrition | Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Yes | 41 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 2 | No | 49 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 3 | Yes | 37 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 4 | No | 33 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 5 | No | 27 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
data.tail()
| EmployeeNumber | Attrition | Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2935 | 2936 | No | 36 | Travel_Frequently | 884 | Research & Development | 23 | 2 | Medical | 1 | ... | 3 | 80 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 2936 | 2937 | No | 39 | Travel_Rarely | 613 | Research & Development | 6 | 1 | Medical | 1 | ... | 1 | 80 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 2937 | 2938 | No | 27 | Travel_Rarely | 155 | Research & Development | 4 | 3 | Life Sciences | 1 | ... | 2 | 80 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 2938 | 2939 | No | 49 | Travel_Frequently | 1023 | Sales | 2 | 3 | Medical | 1 | ... | 4 | 80 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 2939 | 2940 | No | 34 | Travel_Rarely | 628 | Research & Development | 8 | 3 | Medical | 1 | ... | 1 | 80 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
5 rows × 35 columns
data.shape
(2940, 35)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmployeeNumber 2940 non-null int64 1 Attrition 2940 non-null object 2 Age 2940 non-null int64 3 BusinessTravel 2940 non-null object 4 DailyRate 2940 non-null int64 5 Department 2940 non-null object 6 DistanceFromHome 2940 non-null int64 7 Education 2940 non-null int64 8 EducationField 2940 non-null object 9 EmployeeCount 2940 non-null int64 10 EnvironmentSatisfaction 2940 non-null int64 11 Gender 2940 non-null object 12 HourlyRate 2940 non-null int64 13 JobInvolvement 2940 non-null int64 14 JobLevel 2940 non-null int64 15 JobRole 2940 non-null object 16 JobSatisfaction 2940 non-null int64 17 MaritalStatus 2940 non-null object 18 MonthlyIncome 2940 non-null int64 19 MonthlyRate 2940 non-null int64 20 NumCompaniesWorked 2940 non-null int64 21 Over18 2940 non-null object 22 OverTime 2940 non-null object 23 PercentSalaryHike 2940 non-null int64 24 PerformanceRating 2940 non-null int64 25 RelationshipSatisfaction 2940 non-null int64 26 StandardHours 2940 non-null int64 27 StockOptionLevel 2940 non-null int64 28 TotalWorkingYears 2940 non-null int64 29 TrainingTimesLastYear 2940 non-null int64 30 WorkLifeBalance 2940 non-null int64 31 YearsAtCompany 2940 non-null int64 32 YearsInCurrentRole 2940 non-null int64 33 YearsSinceLastPromotion 2940 non-null int64 34 YearsWithCurrManager 2940 non-null int64 dtypes: int64(26), object(9) memory usage: 804.0+ KB
Observations -
converting "objects" to "category" reduces the data space required to store the dataframe
cols = data.select_dtypes(['object'])
cols.columns
Index(['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'Over18', 'OverTime'],
dtype='object')
for i in cols.columns:
data[i] = data[i].astype('category')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EmployeeNumber 2940 non-null int64 1 Attrition 2940 non-null category 2 Age 2940 non-null int64 3 BusinessTravel 2940 non-null category 4 DailyRate 2940 non-null int64 5 Department 2940 non-null category 6 DistanceFromHome 2940 non-null int64 7 Education 2940 non-null int64 8 EducationField 2940 non-null category 9 EmployeeCount 2940 non-null int64 10 EnvironmentSatisfaction 2940 non-null int64 11 Gender 2940 non-null category 12 HourlyRate 2940 non-null int64 13 JobInvolvement 2940 non-null int64 14 JobLevel 2940 non-null int64 15 JobRole 2940 non-null category 16 JobSatisfaction 2940 non-null int64 17 MaritalStatus 2940 non-null category 18 MonthlyIncome 2940 non-null int64 19 MonthlyRate 2940 non-null int64 20 NumCompaniesWorked 2940 non-null int64 21 Over18 2940 non-null category 22 OverTime 2940 non-null category 23 PercentSalaryHike 2940 non-null int64 24 PerformanceRating 2940 non-null int64 25 RelationshipSatisfaction 2940 non-null int64 26 StandardHours 2940 non-null int64 27 StockOptionLevel 2940 non-null int64 28 TotalWorkingYears 2940 non-null int64 29 TrainingTimesLastYear 2940 non-null int64 30 WorkLifeBalance 2940 non-null int64 31 YearsAtCompany 2940 non-null int64 32 YearsInCurrentRole 2940 non-null int64 33 YearsSinceLastPromotion 2940 non-null int64 34 YearsWithCurrManager 2940 non-null int64 dtypes: category(9), int64(26) memory usage: 624.6 KB
we can see that the memory usage has decreased from 804 KB to 624.4 KB, this technique is generally useful for bigger datasets.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| EmployeeNumber | 2940.0 | 1470.500000 | 848.849221 | 1.0 | 735.75 | 1470.5 | 2205.25 | 2940.0 |
| Age | 2940.0 | 36.923810 | 9.133819 | 18.0 | 30.00 | 36.0 | 43.00 | 60.0 |
| DailyRate | 2940.0 | 802.485714 | 403.440447 | 102.0 | 465.00 | 802.0 | 1157.00 | 1499.0 |
| DistanceFromHome | 2940.0 | 9.192517 | 8.105485 | 1.0 | 2.00 | 7.0 | 14.00 | 29.0 |
| Education | 2940.0 | 2.912925 | 1.023991 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| EmployeeCount | 2940.0 | 1.000000 | 0.000000 | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 |
| EnvironmentSatisfaction | 2940.0 | 2.721769 | 1.092896 | 1.0 | 2.00 | 3.0 | 4.00 | 4.0 |
| HourlyRate | 2940.0 | 65.891156 | 20.325969 | 30.0 | 48.00 | 66.0 | 84.00 | 100.0 |
| JobInvolvement | 2940.0 | 2.729932 | 0.711440 | 1.0 | 2.00 | 3.0 | 3.00 | 4.0 |
| JobLevel | 2940.0 | 2.063946 | 1.106752 | 1.0 | 1.00 | 2.0 | 3.00 | 5.0 |
| JobSatisfaction | 2940.0 | 2.728571 | 1.102658 | 1.0 | 2.00 | 3.0 | 4.00 | 4.0 |
| MonthlyIncome | 2940.0 | 6502.931293 | 4707.155770 | 1009.0 | 2911.00 | 4919.0 | 8380.00 | 19999.0 |
| MonthlyRate | 2940.0 | 14313.103401 | 7116.575021 | 2094.0 | 8045.00 | 14235.5 | 20462.00 | 26999.0 |
| NumCompaniesWorked | 2940.0 | 2.693197 | 2.497584 | 0.0 | 1.00 | 2.0 | 4.00 | 9.0 |
| PercentSalaryHike | 2940.0 | 15.209524 | 3.659315 | 11.0 | 12.00 | 14.0 | 18.00 | 25.0 |
| PerformanceRating | 2940.0 | 3.153741 | 0.360762 | 3.0 | 3.00 | 3.0 | 3.00 | 4.0 |
| RelationshipSatisfaction | 2940.0 | 2.712245 | 1.081025 | 1.0 | 2.00 | 3.0 | 4.00 | 4.0 |
| StandardHours | 2940.0 | 80.000000 | 0.000000 | 80.0 | 80.00 | 80.0 | 80.00 | 80.0 |
| StockOptionLevel | 2940.0 | 0.793878 | 0.851932 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
| TotalWorkingYears | 2940.0 | 11.279592 | 7.779458 | 0.0 | 6.00 | 10.0 | 15.00 | 40.0 |
| TrainingTimesLastYear | 2940.0 | 2.799320 | 1.289051 | 0.0 | 2.00 | 3.0 | 3.00 | 6.0 |
| WorkLifeBalance | 2940.0 | 2.761224 | 0.706356 | 1.0 | 2.00 | 3.0 | 3.00 | 4.0 |
| YearsAtCompany | 2940.0 | 7.008163 | 6.125483 | 0.0 | 3.00 | 5.0 | 9.00 | 40.0 |
| YearsInCurrentRole | 2940.0 | 4.229252 | 3.622521 | 0.0 | 2.00 | 3.0 | 7.00 | 18.0 |
| YearsSinceLastPromotion | 2940.0 | 2.187755 | 3.221882 | 0.0 | 0.00 | 1.0 | 3.00 | 15.0 |
| YearsWithCurrManager | 2940.0 | 4.123129 | 3.567529 | 0.0 | 2.00 | 3.0 | 7.00 | 17.0 |
data.describe(include=['category']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition | 2940 | 2 | No | 2466 |
| BusinessTravel | 2940 | 3 | Travel_Rarely | 2086 |
| Department | 2940 | 3 | Research & Development | 1922 |
| EducationField | 2940 | 6 | Life Sciences | 1212 |
| Gender | 2940 | 2 | Male | 1764 |
| JobRole | 2940 | 9 | Sales Executive | 652 |
| MaritalStatus | 2940 | 3 | Married | 1346 |
| Over18 | 2940 | 1 | Y | 2940 |
| OverTime | 2940 | 2 | No | 2108 |
Dropping columns which are not adding any information.
data.drop(['EmployeeNumber','EmployeeCount','StandardHours','Over18'],axis=1,inplace=True)
Let's look at the unqiue values of all the categories
cols_cat= data.select_dtypes(['category'])
for i in cols_cat.columns:
print('Unique values in',i, 'are :')
print(cols_cat[i].value_counts())
print('*'*50)
Unique values in Attrition are : No 2466 Yes 474 Name: Attrition, dtype: int64 ************************************************** Unique values in BusinessTravel are : Travel_Rarely 2086 Travel_Frequently 554 Non-Travel 300 Name: BusinessTravel, dtype: int64 ************************************************** Unique values in Department are : Research & Development 1922 Sales 892 Human Resources 126 Name: Department, dtype: int64 ************************************************** Unique values in EducationField are : Life Sciences 1212 Medical 928 Marketing 318 Technical Degree 264 Other 164 Human Resources 54 Name: EducationField, dtype: int64 ************************************************** Unique values in Gender are : Male 1764 Female 1176 Name: Gender, dtype: int64 ************************************************** Unique values in JobRole are : Sales Executive 652 Research Scientist 584 Laboratory Technician 518 Manufacturing Director 290 Healthcare Representative 262 Manager 204 Sales Representative 166 Research Director 160 Human Resources 104 Name: JobRole, dtype: int64 ************************************************** Unique values in MaritalStatus are : Married 1346 Single 940 Divorced 654 Name: MaritalStatus, dtype: int64 ************************************************** Unique values in OverTime are : No 2108 Yes 832 Name: OverTime, dtype: int64 **************************************************
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
Data Cleaning:
Observations from EDA:
# creating histograms
data.hist(figsize=(14, 14))
plt.show()
age: The variable Age is normally distributed with an average equal to 37 years. Age is positively correlated JobLevel and Education (i.e., the older an employee is, the more educated and at a higher job level they are).DailyRate: It is having a fairly uniform distribution with an average of 800. Employees having lower Daily rates and less monthly wage are more likely to attrite.DistanceFromHome: This shows a right-skewed distribution. Most people stay at a close distance from home with few also leaving away from work. Employees who have to travel a more distance from their home attrite more.HourlyRate: It is an almost uniformly distributed variable. It has mean and median approx equal to 65. The hourly rate doesn't have much impact on attrition.MonthlyIncome: Monthly Income distribution is almost right-skewed. Few employees are earning much higher than the rest of the employees. MonthlyIncome is highly correlated to Job Level (0.95). MonthlyRate: It is a uniformly distributed variable with a median close to 14500. The monthly rate doesn't have much impact on attrition.NumCompaniesWorked: On average people have worked at 2.5 companies with a median of 2 companies. The majority of people have worked in only one company. Approx 350 people are freshers. There are employees worked at 9 companies. They will be the outliers.PercentSalaryHike: It has a right-skewed distribution. It is correlated with performance rating with a coefficient of 0.77. Lesser salary hike also contributes to attrition.TotalWorkingYears: Work experience is having a significant right skew. It also contains few outliers.YearsAtCompany: Significantly skewed towards the right. Also contains outliers.YearsInCurrentRole: It is having few outliers. The lower whisker coincides with the first quartile.YearsWithCurrManager: It is a right-skewed variable with few outliers too.BusinessTravel: Almost 71% of employees have travel rarely and 19% travel frequently. As the travel frequency increases of an employee increases, the Attrition rate increases. There's a ~22% probability of employees attriting who travel frequently.Department: The R&D department consists of almost 65% of employees.EducationField:People with Life science background is dominant over others with almost 41% count.Gender: 60% of people are male while the rest are female.JobRole: Almost 22% of people are sales executives followed by 20% research scientists.MaritalStatus: Almost 46% of people are married and 32% are single.OverTime: Only almost 29% of people are ready to do overtime. Employees who work overtime tend to attrite more. There a ~35% probability of attrition among employees working overtime.Attrition: There's an imbalance in the data with 16% of the employees attriting and rest not.Bivariate Analysis
cols = data[['DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
Attrition vs Earnings of employee:
Attrition vs Years working in company
Attrition vs Previous job roles
Attrition vs Age and Education
cols = data[['Age','DistanceFromHome','Education']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
There's a ~40% probability of attrition among employees with low ratings for environment satisfaction.
Job Involvement looks like a very strong indicator of attrition.
Further investigation to understand how this variable was collected will give more insights.
Salary hikes are a function of Performance ratings.
stacked_barplot(data,"OverTime","Attrition")
Attrition No Yes All OverTime All 2466 474 2940 Yes 578 254 832 No 1888 220 2108 ------------------------------------------------------------------------------------------------------------------------
stratify parameter in the train_test_split function.X = data.drop(['Attrition'],axis=1)
X = pd.get_dummies(X,drop_first=True)
y = data['Attrition'].apply(lambda x : 1 if x=='Yes' else 0)
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)
(2058, 44) (882, 44)
y.value_counts(1)
0 0.838776 1 0.161224 Name: Attrition, dtype: float64
y_test.value_counts(1)
0 0.839002 1 0.160998 Name: Attrition, dtype: float64
Let's define function to provide metric scores(accuracy,recall and precision) on train and test set and a function to show confusion matrix so that we do not have use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.17,1:0.83} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
dtree = DecisionTreeClassifier(criterion='gini',class_weight={0:0.17,1:0.83},random_state=1)
dtree.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
confusion_matrix_sklearn(dtree, X_test, y_test)
Confusion Matrix -
Employee left and the model predicted it correctly that is employee will attrite : True Positive (observed=1,predicted=1)
Employee didn't leave and the model predicted employee will attrite : False Positive (observed=0,predicted=1)
Employee didn't leave and the model predicted employee will not attrite : True Negative (observed=0,predicted=0)
Employee left and the model predicted that employee won't : False Negative (observed=1,predicted=0)
dtree_model_train_perf=model_performance_classification_sklearn(dtree, X_train, y_train)
print("Training performance \n",dtree_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
dtree_model_test_perf=model_performance_classification_sklearn(dtree, X_test, y_test)
print("Testing performance \n",dtree_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.936508 0.830986 0.786667 0.808219
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)
confusion_matrix_sklearn(bagging, X_test, y_test)
bagging_model_train_perf=model_performance_classification_sklearn(bagging, X_train, y_train)
print("Training performance \n",bagging_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.993197 0.957831 1.0 0.978462
bagging_model_test_perf=model_performance_classification_sklearn(bagging, X_test, y_test)
print("Testing performance \n",bagging_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.945578 0.704225 0.943396 0.806452
Bagging Classifier with weighted decision tree
bagging_wt = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini',class_weight={0:0.17,1:0.83},random_state=1),random_state=1)
bagging_wt.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.17,
1: 0.83},
random_state=1),
random_state=1)
confusion_matrix_sklearn(bagging_wt,X_test,y_test)
bagging_wt_model_train_perf=model_performance_classification_sklearn(bagging_wt,X_train,y_train)
print("Training performance \n",bagging_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.993197 0.957831 1.0 0.978462
bagging_wt_model_test_perf=model_performance_classification_sklearn(bagging_wt, X_test, y_test)
print("Testing performance \n",bagging_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.943311 0.704225 0.925926 0.8
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
confusion_matrix_sklearn(rf,X_test,y_test)
rf_model_train_perf=model_performance_classification_sklearn(rf,X_train,y_train)
print("Training performance \n",rf_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_model_test_perf=model_performance_classification_sklearn(rf,X_test,y_test)
print("Testing performance \n",rf_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.961451 0.802817 0.95 0.870229
Random forest with class weights
rf_wt = RandomForestClassifier(class_weight={0:0.17,1:0.83}, random_state=1)
rf_wt.fit(X_train,y_train)
RandomForestClassifier(class_weight={0: 0.17, 1: 0.83}, random_state=1)
confusion_matrix_sklearn(rf_wt, X_test,y_test)
rf_wt_model_train_perf=model_performance_classification_sklearn(rf_wt, X_train,y_train)
print("Training performance \n",rf_wt_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_wt_model_test_perf=model_performance_classification_sklearn(rf_wt, X_test,y_test)
print("Testing performance \n",rf_wt_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.961451 0.788732 0.965517 0.868217
Tuning Decision Tree
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15, None],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=19, min_impurity_decrease=0.0001,
random_state=1)
confusion_matrix_sklearn(dtree_estimator, X_test,y_test)
dtree_estimator_model_train_perf=model_performance_classification_sklearn(dtree_estimator, X_train,y_train)
print("Training performance \n",dtree_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 0.996599 0.978916 1.0 0.989346
dtree_estimator_model_test_perf=model_performance_classification_sklearn(dtree_estimator, X_test, y_test)
print("Testing performance \n",dtree_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.939909 0.809859 0.815603 0.812721
Tuning Bagging Classifier
# grid search for bagging classifier
cl1 = DecisionTreeClassifier(random_state=1)
param_grid = {'base_estimator':[cl1],
'n_estimators':[5,7,15,51,101],
'max_features': [0.7,0.8,0.9,1]
}
grid = GridSearchCV(BaggingClassifier(random_state=1,bootstrap=True), param_grid=param_grid, scoring = 'recall', cv = 5)
grid.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1),
param_grid={'base_estimator': [DecisionTreeClassifier(random_state=1)],
'max_features': [0.7, 0.8, 0.9, 1],
'n_estimators': [5, 7, 15, 51, 101]},
scoring='recall')
## getting the best estimator
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
max_features=0.9, n_estimators=51, random_state=1)
confusion_matrix_sklearn(bagging_estimator, X_test,y_test)
bagging_estimator_model_train_perf=model_performance_classification_sklearn(bagging_estimator, X_train,y_train)
print("Training performance \n",bagging_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
bagging_estimator_model_test_perf=model_performance_classification_sklearn(bagging_estimator, X_test, y_test)
print("Testing performance \n",bagging_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.959184 0.816901 0.920635 0.865672
Tuning Random Forest
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110,251,501],
"min_samples_leaf": np.arange(1, 6,1),
"max_features": [0.7,0.9,'log2','auto'],
"max_samples": [0.7,0.9,None],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring='recall',cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(max_features=0.9, max_samples=0.9, n_estimators=110,
random_state=1)
confusion_matrix_sklearn(rf_estimator, X_test,y_test)
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator, X_train,y_train)
print("Training performance \n",rf_estimator_model_train_perf)
Training performance
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator, X_test, y_test)
print("Testing performance \n",rf_estimator_model_test_perf)
Testing performance
Accuracy Recall Precision F1
0 0.961451 0.802817 0.95 0.870229
# training performance comparison
models_train_comp_df = pd.concat(
[dtree_model_train_perf.T,bagging_model_train_perf.T, bagging_wt_model_train_perf.T,rf_model_train_perf.T,
rf_wt_model_train_perf.T,dtree_estimator_model_train_perf.T, bagging_estimator_model_train_perf.T,
rf_estimator_model_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree",
"Bagging Classifier",
"Weighted Bagging Classifier",
"Random Forest Classifier",
"Weighted Random Forest Classifier",
"Decision Tree Estimator",
"Bagging Estimator",
"Random Forest Estimator"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree | Bagging Classifier | Weighted Bagging Classifier | Random Forest Classifier | Weighted Random Forest Classifier | Decision Tree Estimator | Bagging Estimator | Random Forest Estimator | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 1.0 | 0.993197 | 0.993197 | 1.0 | 1.0 | 0.996599 | 1.0 | 1.0 |
| Recall | 1.0 | 0.957831 | 0.957831 | 1.0 | 1.0 | 0.978916 | 1.0 | 1.0 |
| Precision | 1.0 | 1.000000 | 1.000000 | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 |
| F1 | 1.0 | 0.978462 | 0.978462 | 1.0 | 1.0 | 0.989346 | 1.0 | 1.0 |
# training performance comparison
models_test_comp_df = pd.concat(
[dtree_model_test_perf.T,bagging_model_test_perf.T, bagging_wt_model_test_perf.T,rf_model_test_perf.T,
rf_wt_model_test_perf.T,dtree_estimator_model_test_perf.T, bagging_estimator_model_test_perf.T,
rf_estimator_model_test_perf.T],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree",
"Bagging Classifier",
"Weighted Bagging Classifier",
"Random Forest Classifier",
"Weighted Random Forest Classifier",
"Decision Tree Estimator",
"Bagging Estimator",
"Random Forest Estimator"]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree | Bagging Classifier | Weighted Bagging Classifier | Random Forest Classifier | Weighted Random Forest Classifier | Decision Tree Estimator | Bagging Estimator | Random Forest Estimator | |
|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.936508 | 0.945578 | 0.943311 | 0.961451 | 0.961451 | 0.939909 | 0.959184 | 0.961451 |
| Recall | 0.830986 | 0.704225 | 0.704225 | 0.802817 | 0.788732 | 0.809859 | 0.816901 | 0.802817 |
| Precision | 0.786667 | 0.943396 | 0.925926 | 0.950000 | 0.965517 | 0.815603 | 0.920635 | 0.950000 |
| F1 | 0.808219 | 0.806452 | 0.800000 | 0.870229 | 0.868217 | 0.812721 | 0.865672 | 0.870229 |
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(rf.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp MonthlyIncome 0.080810 OverTime_Yes 0.059174 Age 0.055292 DailyRate 0.053302 TotalWorkingYears 0.052114 HourlyRate 0.050174 MonthlyRate 0.048463 DistanceFromHome 0.047056 YearsAtCompany 0.039540 PercentSalaryHike 0.031674 YearsWithCurrManager 0.031058 YearsInCurrentRole 0.029408 NumCompaniesWorked 0.029030 TrainingTimesLastYear 0.028230 EnvironmentSatisfaction 0.026136 StockOptionLevel 0.025781 JobInvolvement 0.025729 JobSatisfaction 0.025468 WorkLifeBalance 0.025087 JobLevel 0.023950 YearsSinceLastPromotion 0.023892 Education 0.021817 RelationshipSatisfaction 0.021528 MaritalStatus_Single 0.013513 BusinessTravel_Travel_Frequently 0.012322 Gender_Male 0.009384 MaritalStatus_Married 0.009269 JobRole_Research Scientist 0.008532 Department_Research & Development 0.008405 BusinessTravel_Travel_Rarely 0.008310 EducationField_Medical 0.007918 EducationField_Life Sciences 0.007847 EducationField_Technical Degree 0.007828 JobRole_Sales Representative 0.007542 Department_Sales 0.007284 EducationField_Marketing 0.007281 JobRole_Laboratory Technician 0.007070 JobRole_Sales Executive 0.005483 PerformanceRating 0.004814 JobRole_Human Resources 0.004303 JobRole_Manufacturing Director 0.003016 EducationField_Other 0.002766 JobRole_Manager 0.001687 JobRole_Research Director 0.000715
feature_names = X_train.columns
importances = rf_estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
We have been able to build a predictive model: a) that company can deploy this model to identify employees who are at the risk of attrition. b) that company can use to find the drivers of attrition. c) based on which company can take appropriate actions to build better retention policies.
Factors that drive attrition - Monthly Income, Overtime, and Age.
Monthly Income: Employees with lower income attrite more, which is also logical as they might get offers with higher pay in different organizations - the company should make sure that all employees are compensated based on industry standards.
Overtime: Those employees who have to work overtime are the ones who attrite more - the company can provide some additional incentives to such employees to retain them.
Age: Younger employees are the ones that attrite more- the company can make sure the new joiners have a friendly environment and better opportunities for excelling in their career.
Distance From home is also an important factor for attrition - employees traveling more distance to reach the workplace are the ones attriting. For such employees, the company can provide cab facilities so that the commute of employees gets easier.
As work-related travel frequency increases, Attrition rate also increases - the company should
Training doesn't seem to have an impact on attrition- the company needs to investigate more here, if training does not impact employee retention then better cost planning can be done.
Employee with more experience and the employees working for most years in the company is the loyal one's and generally do not attrite.
Highest attrition is in the Sales department more research should go into this to check what is wrong in the sales department?
Our data collection technique is working well as the ratings given by employees in -Environment Satisfaction, Job Satisfaction, Relationship Satisfaction, and Work-Life Balance shows a difference significant difference between attriting and non-attriting employees. These scales can act as a preliminary step to understand the dissatisfaction of employees - Lower the rating higher are the chances of attrition.
histogram_boxplot(data,'Age')
histogram_boxplot(data,'DailyRate')
histogram_boxplot(data,'DistanceFromHome')
histogram_boxplot(data,'HourlyRate')
histogram_boxplot(data,'MonthlyIncome')
histogram_boxplot(data,'MonthlyRate')
histogram_boxplot(data,'NumCompaniesWorked')
histogram_boxplot(data,'PercentSalaryHike')
histogram_boxplot(data,'TotalWorkingYears')
histogram_boxplot(data,'YearsAtCompany')
histogram_boxplot(data,'YearsInCurrentRole')
histogram_boxplot(data,'YearsSinceLastPromotion')
histogram_boxplot(data,'YearsWithCurrManager')
labeled_barplot(data, "BusinessTravel", perc=True)
labeled_barplot(data, "Department", perc=True)
labeled_barplot(data, "EducationField", perc=True)
labeled_barplot(data, "Gender", perc=True)
labeled_barplot(data, "JobRole", perc=True)
labeled_barplot(data, "MaritalStatus", perc=True)
labeled_barplot(data, "OverTime", perc=True)
labeled_barplot(data, "Attrition", perc=True)
plt.figure(figsize=(20,10))
sns.heatmap(data.corr(),annot=True,vmin=-1,vmax=1,fmt='.2f',cmap="Spectral")
plt.show()
sns.pairplot(data,hue='Attrition')
plt.show()
cols = data[['DailyRate','HourlyRate','MonthlyRate','MonthlyIncome','PercentSalaryHike']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
cols = data[['YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager','TrainingTimesLastYear']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
cols = data[['NumCompaniesWorked','TotalWorkingYears']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
cols = data[['Age','DistanceFromHome','Education']].columns.tolist()
plt.figure(figsize=(10,10))
for i, variable in enumerate(cols):
plt.subplot(3,2,i+1)
sns.boxplot(data["Attrition"],data[variable],palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
stacked_barplot(data, "BusinessTravel", "Attrition")
Attrition No Yes All BusinessTravel All 2466 474 2940 Travel_Rarely 1774 312 2086 Travel_Frequently 416 138 554 Non-Travel 276 24 300 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Department", "Attrition")
Attrition No Yes All Department All 2466 474 2940 Research & Development 1656 266 1922 Sales 708 184 892 Human Resources 102 24 126 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"EducationField","Attrition")
Attrition No Yes All EducationField All 2466 474 2940 Life Sciences 1034 178 1212 Medical 802 126 928 Marketing 248 70 318 Technical Degree 200 64 264 Other 142 22 164 Human Resources 40 14 54 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"EnvironmentSatisfaction","Attrition")
Attrition No Yes All EnvironmentSatisfaction All 2466 474 2940 1 424 144 568 3 782 124 906 4 772 120 892 2 488 86 574 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"JobInvolvement","Attrition")
Attrition No Yes All JobInvolvement All 2466 474 2940 3 1486 250 1736 2 608 142 750 1 110 56 166 4 262 26 288 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"JobLevel","Attrition")
Attrition No Yes All JobLevel All 2466 474 2940 1 800 286 1086 2 964 104 1068 3 372 64 436 4 202 10 212 5 128 10 138 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"JobRole","Attrition")
Attrition No Yes All JobRole All 2466 474 2940 Laboratory Technician 394 124 518 Sales Executive 538 114 652 Research Scientist 490 94 584 Sales Representative 100 66 166 Human Resources 80 24 104 Manufacturing Director 270 20 290 Healthcare Representative 244 18 262 Manager 194 10 204 Research Director 156 4 160 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"JobSatisfaction","Attrition")
Attrition No Yes All JobSatisfaction All 2466 474 2940 3 738 146 884 1 446 132 578 4 814 104 918 2 468 92 560 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"MaritalStatus","Attrition")
Attrition No Yes All MaritalStatus All 2466 474 2940 Single 700 240 940 Married 1178 168 1346 Divorced 588 66 654 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"OverTime","Attrition")
Attrition No Yes All OverTime All 2466 474 2940 Yes 578 254 832 No 1888 220 2108 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"RelationshipSatisfaction","Attrition")
Attrition No Yes All RelationshipSatisfaction All 2466 474 2940 3 776 142 918 4 736 128 864 1 438 114 552 2 516 90 606 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"StockOptionLevel","Attrition")
Attrition No Yes All StockOptionLevel All 2466 474 2940 0 954 308 1262 1 1080 112 1192 3 140 30 170 2 292 24 316 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data,"WorkLifeBalance","Attrition")
Attrition No Yes All WorkLifeBalance All 2466 474 2940 3 1532 254 1786 2 572 116 688 4 252 54 306 1 110 50 160 ------------------------------------------------------------------------------------------------------------------------
Checking if performace rating and salary hike are related-
sns.boxplot(data['PerformanceRating'],data['PercentSalaryHike'])
plt.show()
plt.figure(figsize=(15,5))
sns.boxplot(data['PerformanceRating'],data['PercentSalaryHike'],hue=data['JobRole'])
plt.show()
Observations-